[Serve] Add `label_selector` and `bundle_label_selector` to Serve API #57694

ryanaoleary · 2025-10-14T07:06:38Z

Why are these changes needed?

This PR adds the label_selector option to the supported list of Actor options for a Serve deployment. Additionally, we add bundle_label_selector to specify label selectors for bundles when placement_group_bundles are specified for the deployment. These two options are already supported for Tasks/Actors and placement groups respectively.

Example use case:

llm_config = LLMConfig(
    model_loading_config={
        "model_id": "meta-llama/Meta-Llama-3-70B-Instruct",
        "model_source": "huggingface",
    },
    engine_kwargs=tpu_engine_config,
    resources_per_bundle={"TPU": 4},
    runtime_env={"env_vars": {"VLLM_USE_V1": "1"}},
    deployment_config={
        "num_replicas": 4,
        "ray_actor_options": {
            # In a GKE cluster with multiple TPU node-pools, schedule
            # only to the desired slice.
            "label_selector": {
                "ray.io/tpu-topology": "4x4" # added by default by Ray
            }
        }
    }
)

The expected behaviors of these new fields is as follows:

Pack scheduling enabled

PACK/STRICT_PACK PG strategy:

Standard PG without bundle_label_selector or fallback:
- Sorts replicas by resource size (descending). Attempts to find the "best fit" node (minimizing fragmentation) that has available resources. Creates a Placement Group on that target node.
PG node label selector provided:
- Same behavior as regular placement group but filters the list of candidate nodes to only those matching the label selector before finding the best fit
PG node label selector and fallback:
Same as above but when scheduling tries the following:
1. Tries to find a node matching the primary placement_group_bundles and bundle_label_selector.
2. If no node fits, iterates through the placement_group_fallback_strategy. For each fallback entry, tries to find a node matching that entry's bundles and labels.
3. If a node is found, creates a PG on it.

SPREAD/STRICT_SPREAD PG strategy:

If any deployment uses these strategies, the global logic falls back to "Spread Scheduling" (see below)

Spread scheduling enabled

Standard PG without bundle_label_selector or fallback:
- Creates a Placement Group via Ray Core without specifying a target_node_id. Ray Core decides placement based on the strategy.
PG node label selector provided:
- Serve passes the bundle_label_selector to the CreatePlacementGroupRequest. Ray Core handles the soft/hard constraint logic during PG creation.
PG node label selector and fallback:
- Serve passes the bundle_label_selector to the CreatePlacementGroupRequest, fallback_strategy is not yet supported in the placement group options so this field isn't passed / considered. It's only used in the "best fit" node selection logic which is skipped for Spread scheduling.

Related issue number

#51564

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-10-14T07:10:43Z

cc: @MengjinYan @eicherseiji I think this change can help enable TPU use-cases with Ray LLM, since it'll allow users to target the desired slice/topology based on labels like these:

ray.io/accelerator-type: TPU-V6E
ray.io/node-group: tpu-group
ray.io/node-id: 0870a6a06413aed6079c15eeaa4f61e8a1413fa6140fc70c93608505
ray.io/tpu-pod-type: v6e-16
ray.io/tpu-slice-name: tpu-group-0
ray.io/tpu-topology: 4x4
ray.io/tpu-worker-id: '2'

MengjinYan

The current change looks good.

At the same time, I have a general question, probably also to @eicherseiji as well: I'm not familiar with the serve codebase. But from the look at the code, the change in the PR seems not cover the whole code path from the replica config to actually creating the placement group where we need to apply the bundle label selector (e.g. CreatePlacementGroupRequest, DeploymentTargetState, DeploymentVersion, etc.).

Wondering if we should include the change to the rest of the code path in this PR as well?

python/ray/serve/_private/config.py

eicherseiji · 2025-10-16T01:19:10Z

Hi @ryanaoleary! Seems like it may be useful, but could you go into more detail about the problem this solved for you? I.e. is it typical to have multiple TPU node pools/heterogeneous compute in a TPU cluster?

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-10-16T04:29:00Z

Hi @ryanaoleary! Seems like it may be useful, but could you go into more detail about the problem this solved for you? I.e. is it typical to have multiple TPU node pools/heterogeneous compute in a TPU cluster?

The use-case would be for TPU multi-slice, where it's important that the Ray nodes reserved with some group of resources (i.e. a bundle of TPUs) are a part of the same actual TPU slice so that they can benefit from the high speed ICI interconnects. There are GKE webhooks that set some TPU information as env vars when the Pods are created, including topology and a unique identifier for the slice which we set as Ray node labels. They also inject podAffinities so that the TPU Pods are scheduled to the same node-pool (i.e. TPU slice) and co-located. So if we then use those labels for scheduling in the application code, we can guarantee that the workers are running on co-located TPU devices.

For a RayCluster with multiple TPU slices (of the same topology or different), we currently only schedule using TPU: <number-resources> and/or the TPU generation (i.e. TPU-V6E) which can result in the placement group spanning across TPU slices.

eicherseiji · 2025-10-16T16:50:10Z

@ryanaoleary I see, thanks! Is there a reason we can't extend/re-use the resources_per_bundle concept for this instead?

MengjinYan · 2025-10-16T16:55:05Z

Hi @ryanaoleary! Seems like it may be useful, but could you go into more detail about the problem this solved for you? I.e. is it typical to have multiple TPU node pools/heterogeneous compute in a TPU cluster?

In addition to the TPU support, in general, we want to have label support in all library APIs so that user can do scheduling based on node labels as well.

ryanaoleary · 2025-10-16T17:19:09Z

@ryanaoleary I see, thanks! Is there a reason we can't extend/re-use the resources_per_bundle concept for this instead?

I think just configuring resources_per_bundle with a TPU resource alone wouldn't work, we'd have to add some matching custom resource that denotes the TPU slice name to Pods of the same slice and use it as a workaround. That would also work, but moving away from this workaround was one of the rationales for adding label selectors. Additionally, we already set these Ray node labels for TPUs by default so it'd be less work for the user to get it working.

github-actions · 2025-10-31T00:38:00Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

python/ray/serve/_private/config.py

python/ray/serve/tests/test_api.py

abrarsheikh · 2025-12-04T07:04:26Z

@ryanaoleary I will review this PR next week

ktyxx · 2025-12-09T11:09:46Z

thanks for working on this! Really looking forward to using label selectors in Serve.
One thing I was wondering - would it be possible to also expose fallback_strategy from ray core?
My use case is that I'd like to use labels more as a "preference" rather than a strict requirement. For example, I want to prioritize deploying to nodes with a specific label, but if none are available, just fall back to any node (like the default scheduling behavior). I actually opened an issue about this before (#59055) - initially thinking about using NodeLabelSchedulingStrategy with hard/soft params, but was told it's being deprecated in favor of label selectors. So it'd be great if Serve could also expose fallback_strategy to support soft constraints via fallback_strategy=[{"label_selector": {}}] pattern.
What do you think?

python/ray/serve/tests/test_deployment_version.py

python/ray/serve/tests/unit/test_application_state.py

python/ray/serve/_private/deployment_scheduler.py

Signed-off-by: ryanaoleary <[email protected]>

python/ray/serve/tests/unit/test_application_state.py

Signed-off-by: ryanaoleary <[email protected]>

python/ray/_private/label_utils.py

abrarsheikh · 2026-01-09T17:55:10Z

python/ray/serve/_private/deployment_scheduler.py

+            # Actor: Use Actor label selector
+            if "label_selector" in scheduling_request.actor_options:
+                primary_labels = [
+                    scheduling_request.actor_options["label_selector"] or {}


is None allowed in actor_options["label_selector"]?

Yeah None is currently allowed there, it defaults to an empty dict {} which results in no label constraints

abrarsheikh · 2026-01-09T17:58:33Z

python/ray/serve/_private/deployment_scheduler.py

+                fallback_labels = [fallback.get("label_selector", {}) or {}]
+                placement_candidates.append(
+                    (scheduling_request.required_resources, fallback_labels)
+                )


can there be a fallback with label_selector == None?

Also should we skip adding to placement_candidates if fallback_labels is empty?

Yeah there can be fallback with label_selector == None, it is treated as an empty dictionary which is no label constraints. If a user adds a fallback strategy with no labels, resulting in fallback_labels being empty, I think we still need to add it to placement candidates so that it tries the "match all nodes regardless of labels" behavior.

python/ray/tests/test_label_utils.py

python/ray/serve/tests/conftest.py

python/ray/serve/tests/test_deployment_scheduler.py

Co-authored-by: Abrar Sheikh <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: ryanaoleary <[email protected]> Trim whitepsace and fix invalid test case Signed-off-by: ryanaoleary <[email protected]>

Signed-off-by: ryanaoleary <[email protected]>

…ests for this Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary · 2026-01-14T03:40:37Z

@abrarsheikh I believe I've resolved all the outstanding comments.

Also in 21bd155, I added some logic where if a single placement_group_bundle_label_selector is provided, but multiple bundles are provided, we apply that label selector uniformly to all bundles rather than requiring their length be equal. I think this is a good quality of life improvement for developers since it could be tedious to copy-paste identical selectors for each bundle. This also matches the implementation in Ray Train's ScalingConfig.

python/ray/serve/schema.py

abrarsheikh

Looks good to me. Left a few nits.

In python/ray/serve/tests/test_deployment_scheduler.py please include tests for non-happy path, eg: when no label selectors can be satisfied, specially interested in the PACK strategy case.

python/ray/serve/_private/config.py

python/ray/serve/tests/test_deployment_scheduler.py

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary · 2026-01-15T10:17:06Z

Created a bug fix for an issue I found with bundle_label_selector and STRICT_PACK placement group when testing this PR. It shouldn't be blocking for this change though because we filter in the Serve scheduler based on labels. #60170

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary · 2026-01-15T12:53:51Z

a9f1954

Added two tests in a9f1954 that cover the unhappy path cases where the label selector or fallback strategy are unschedulable.

python/ray/serve/tests/unit/test_schema.py

python/ray/_private/label_utils.py

Signed-off-by: ryanaoleary <[email protected]>

abrarsheikh

I will let @MengjinYan approve, and then I can merge

MengjinYan · 2026-01-17T00:16:07Z

python/ray/_raylet.pyx

    return 0

+cdef extern from * nogil:
+    """


I think it is better to add the setLabels() function in cluster_resource_data.h c++ to avoid embedded code here.

Add label_selector and bundle_label_selector to Serve API

ab6cd99

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary marked this pull request as ready for review October 14, 2025 07:08

ryanaoleary requested a review from a team as a code owner October 14, 2025 07:08

This comment was marked as outdated.

Sign in to view

ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Oct 14, 2025

ryanaoleary mentioned this pull request Oct 14, 2025

[Core] Ray Label Selector API Implementation Tracker #51564

Open

35 tasks

eicherseiji self-requested a review October 14, 2025 23:21

MengjinYan added the go add ONLY when ready to merge, run all tests label Oct 15, 2025

MengjinYan reviewed Oct 15, 2025

View reviewed changes

python/ray/serve/_private/config.py Outdated Show resolved Hide resolved

MengjinYan assigned MengjinYan and eicherseiji Oct 15, 2025

Fix argument order

0a449e1

Signed-off-by: Ryan O'Leary <[email protected]>

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 31, 2025

Merge branch 'master' into add-label-selector-serve-option

a900e41

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Nov 4, 2025

Merge branch 'master' into add-label-selector-serve-option

6862eee

cursor bot reviewed Nov 5, 2025

View reviewed changes

python/ray/serve/_private/config.py Outdated Show resolved Hide resolved

zcin reviewed Nov 5, 2025

View reviewed changes

python/ray/serve/_private/config.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_api.py Outdated Show resolved Hide resolved

abrarsheikh mentioned this pull request Dec 4, 2025

[Serve] Support hard node label constraints in deployment scheduling #59055

Open

cursor bot reviewed Jan 9, 2026

View reviewed changes

python/ray/serve/tests/test_deployment_version.py Outdated Show resolved Hide resolved

python/ray/serve/tests/unit/test_application_state.py Outdated Show resolved Hide resolved

python/ray/serve/_private/deployment_scheduler.py Outdated Show resolved Hide resolved

Fix key names and add NotImplementedError

89d4390

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary requested a review from abrarsheikh January 9, 2026 02:33

cursor bot reviewed Jan 9, 2026

View reviewed changes

python/ray/serve/tests/unit/test_application_state.py Show resolved Hide resolved

Fix test key

bc76515

Signed-off-by: ryanaoleary <[email protected]>

abrarsheikh reviewed Jan 9, 2026

View reviewed changes

ryanaoleary and others added 3 commits January 13, 2026 09:18

Update python/ray/tests/test_label_utils.py

3c7c07f

Co-authored-by: Abrar Sheikh <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Trim whitespace around selector and fix invalid test case

206a89d

Signed-off-by: ryanaoleary <[email protected]> Trim whitepsace and fix invalid test case Signed-off-by: ryanaoleary <[email protected]>

Add more test cases and re-structure under relevant classes

8cd0b0b

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary requested a review from abrarsheikh January 14, 2026 03:37

Enable single bundle_label_selector to apply to all bundles and add t…

21bd155

…ests for this Signed-off-by: ryanaoleary <[email protected]>

Merge branch 'master' into add-label-selector-serve-option

9f8d570

cursor bot reviewed Jan 14, 2026

View reviewed changes

python/ray/serve/schema.py Show resolved Hide resolved

abrarsheikh reviewed Jan 14, 2026

View reviewed changes

ryanaoleary added 2 commits January 15, 2026 01:52

Fix validate_bundle_label_selector for single selector case

56f8bdf

Signed-off-by: ryanaoleary <[email protected]>

Fix undescriptive normalize function name

a9f1954

Signed-off-by: ryanaoleary <[email protected]>

Update tests and add unhappy path coverage

0a150b8

Signed-off-by: ryanaoleary <[email protected]>

Merge branch 'master' into add-label-selector-serve-option

331e8cd

ryanaoleary requested a review from abrarsheikh January 15, 2026 12:55

cursor bot reviewed Jan 15, 2026

View reviewed changes

python/ray/serve/tests/unit/test_schema.py Show resolved Hide resolved

MengjinYan reviewed Jan 15, 2026

View reviewed changes

python/ray/_private/label_utils.py Outdated Show resolved Hide resolved

ryanaoleary and others added 3 commits January 15, 2026 22:40

Fix test_schema.py for validate bundle_label_selector

d21d001

Signed-off-by: ryanaoleary <[email protected]>

Import match label selector logic from C++ instead

bc26060

Signed-off-by: ryanaoleary <[email protected]>

Merge branch 'master' into add-label-selector-serve-option

523de76

ryanaoleary requested a review from MengjinYan January 16, 2026 00:28

abrarsheikh approved these changes Jan 16, 2026

View reviewed changes

MengjinYan reviewed Jan 17, 2026

View reviewed changes

[Serve] Add label_selector and bundle_label_selector to Serve API #57694

Are you sure you want to change the base?

[Serve] Add label_selector and bundle_label_selector to Serve API #57694

Uh oh!

Conversation

ryanaoleary commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Pack scheduling enabled

Spread scheduling enabled

Related issue number

Checks

Uh oh!

ryanaoleary commented Oct 14, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

MengjinYan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eicherseiji commented Oct 16, 2025

Uh oh!

ryanaoleary commented Oct 16, 2025

Uh oh!

eicherseiji commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengjinYan commented Oct 16, 2025

Uh oh!

ryanaoleary commented Oct 16, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh commented Dec 4, 2025

Uh oh!

ktyxx commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Jan 14, 2026

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanaoleary commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

[Serve] Add `label_selector` and `bundle_label_selector` to Serve API #57694

[Serve] Add `label_selector` and `bundle_label_selector` to Serve API #57694

ryanaoleary commented Oct 14, 2025 •

edited

Loading

eicherseiji commented Oct 16, 2025 •

edited

Loading

ryanaoleary commented Jan 15, 2026 •

edited

Loading